System and method for controlling application of an error correction code (ECC) algorithm in a memory subsystem

ABSTRACT

A method for controlling application of an erasure mode of an error correction code (ECC) algorithm in a memory subsystem includes detecting errors in cache lines retrieved from the memory subsystem using the ECC algorithm. The method also analyzes the errors to detect a repeated bit pattern of data corruption within the cache lines, correlates the detected repeated bit pattern of data corruption to one of a plurality of domains of the memory subsystem, and applies the ECC algorithm to erase bits associated with the detected repeated bit pattern from cache lines retrieved from the correlated domain of the memory subsystem.

CROSS-REFERENCE TO RELATED APPLICATION(S)

The present application is related to U.S. patent application Ser. No.10/435,150, filed May 9, 2003, entitled “SYSTEMS AND METHODS FORPROCESSING AN ERROR CORRECTION CODE WORD FOR STORAGE IN MEMORYCOMPONENTS,” which is incorporated herein by reference; this applicationis also related to concurrently filed and commonly assigned U.S. patentapplication Ser. No. 10/879,262, entitled “SYSTEM AND METHOD FORCONTROLLING APPLICATION OF AN ERROR CORRECTION CODE (ECC) ALGORITHM IN AMEMORY SUBSYSTEM,” and U.S. patent application Ser. No. 10/879,643,entitled “SYSTEM AND METHOD FOR APPLYING ERROR CORRECTION CODE (ECC)ERASURE MODE AND CLEARING RECORDED INFORMATION FROM A PAGE DEALLOCATIONTABLE,” which are incorporated herein by reference.

DESCRIPTION OF RELATED ART

Electronic data storage utilizing commonly available memories (such asdynamic random access memory (DRAM)) can be problematic. Specifically,there is a probability that, when data is stored in memory andsubsequently retrieved, the retrieved data will suffer some corruption.For example, DRAM stores information in relatively small capacitors thatmay suffer a transient corruption due to a variety of mechanisms.Additionally, data corruption may occur as the result of hardwarefailures such as loose memory modules, blown chips, wiring defects,and/or the like. The errors caused by such failures are referred to asrepeatable errors, since the same physical mechanism repeatedly causesthe same pattern of data corruption.

A variety of error detection and error correction mechanisms have beendeveloped to mitigate the effects of data corruption. For example, errordetection and correction algorithms may be embedded in a number ofcomponents in a computer system to address data corruption. Frequently,ECC algorithms are embedded in memory controllers such as coherentmemory controllers in distributed shared memory architectures.

In general, error detection algorithms employ redundant data added to astring of data. The redundant data is calculated utilizing a check-sumor cyclic redundancy check (CRC) operation. When the string of data andthe original redundant data is retrieved, the redundant data isrecalculated utilizing the retrieved data. If the recalculated redundantdata does not match the original redundant data, data corruption in theretrieved data is detected.

Error correction code (ECC) algorithms operate in a manner similar toerror detection algorithms. When data is stored, redundant data iscalculated and stored in association with the data. When the data andthe redundant data are subsequently retrieved, the redundant data isrecalculated and compared to the retrieved redundant data. When an erroris detected (e.g, the original and recalculated redundant data do notmatch), the original and recalculated redundant data may be used tocorrect certain categories of errors. An example of a known ECC schemeis described in “Single Byte Error Correcting-Double Byte ErrorDetecting Codes for Memory subsystems” by Shigeo Kaneda and EijiFujiwara, published in IEEE TRANSACTIONS on COMPUTERS, Vol. C31, No. 7,July 1982.

SUMMARY

In one embodiment of the invention, a computer readable medium,comprising executable instructions for controlling application of anerror correction code (ECC) algorithm in a memory subsystem, comprisescode for recording occurrences of data corruption in data retrieved fromthe memory subsystem, code for analyzing the occurrences of datacorruption to detect a repeated bit pattern of data corruption acrossdifferent addresses of the memory subsystem, and code for controllingapplication of the ECC algorithm to erase bits associated with arepeated bit pattern, detected by the code for analyzing, from dataretrieved from the memory subsystem.

In another embodiment of the invention, a method for controllingapplication of an erasure mode of an error correction code (ECC)algorithm in a memory subsystem, comprises detecting errors in cachelines retrieved from the memory subsystem using the ECC algorithm,analyzing the errors to detect a repeated bit pattern of data corruptionwithin the cache lines, correlating the detected repeated bit pattern ofdata corruption to one of a plurality of domains of the memorysubsystem, and applying the ECC algorithm to erase bits associated withthe detected repeated bit pattern from cache lines retrieved from thecorrelated domain of the memory subsystem.

In another embodiment of the invention, a system comprises memory meansfor storing data, memory controller means for storing caches lines inand retrieving cache lines from the memory means, wherein the memorycontroller means applies an error correction code (ECC) algorithm to thecache lines to erase predetermined bit locations within the cache lines,means for recording instances of data corruption in cache lines detectedby the memory controller means, means for differentiating the instancesof data corruption according to transient errors, repeatable errorsassociated with a memory bank, repeatable errors associated with amemory rank, repeatable errors associated with a bus, repeatable errorsassociated with all of the memory means, and means for activating anerasure mode for the ECC algorithm for a domain of the memory means inresponse to the means for differentiating

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts a memory subsystem that performs data storage using aselectively enabled erasure mode according to one representativeembodiment.

FIG. 2 depicts a computer system employing a software algorithm thatrecords and analyzes memory errors to control the selective activationof erasure mode processing according to one representative embodiment.

FIG. 3 depicts a flowchart for analyzing data corruption to selectivelyenable an erasure mode according to one representative embodiment.

FIG. 4 depicts a flowchart for analyzing data corruption according toone representative embodiment.

DETAILED DESCRIPTION

Embodiments of the present invention are directed to employing an ECCalgorithm within a memory subsystem to provide increased reliability ofthe memory subsystem. In one representative embodiment, the ECCalgorithm enables multiple “single-byte” errors to be corrected within asingle cache line. A single-byte error refers to corruption of anynumber of bits within eight adjacent bits of a cache line alignedaccording to eight-bit boundaries. The correctable errors may betransient single-byte errors. Moreover, representative embodimentsenable correction of repeatable errors within a single cache line inaddition to the correction of transient errors. The repeatable errorsmay be caused by a failing DRAM part, a memory interconnect malfunction,a memory interface logic malfunction, and/or the like. The correction ofa repeatable error occurs according to an “erasure” mode. “Erasing”refers to decoding an ECC code word by assuming that an identified bitor bits are corrupted. The erasure mode is activated by loading aregister in a memory controller with a suitable value to identify thelocation of the repeatable error.

When the erasure mode is activated, performance issues are raised. Theadditional processing associated with the erasure mode causes memorytransactions to consume additional time. Also, the probability ofdecoding an uncorrectable error as correctable is increased due to themathematical properties of the ECC algorithm. Even though the increasedprobability is relatively small, the probability is not insignificant inrelatively large memory subsystems.

In one representative embodiment, a software algorithm maintains arecord of data corruption detected by the ECC algorithm to enableselective activation of the erasure mode. The software algorithmanalyzes the occurrences of data corruption to identify repeated bitpatterns. If a repeated bit pattern is identified, the softwarealgorithm correlates the occurrence of data corruption to a particular“domain” (a particular level or a particular component) of the memorysubsystem. If a sufficient number of occurrences of data corruption havebeen detected as originating from the particular domain according to thesame pattern of data corruption, the software algorithm activates theerasure mode for the domain of the memory subsystem. The softwarealgorithm may activate the erasure mode by setting appropriate registersof the memory controller of the memory subsystem. The memory subsystemresponds by decoding ECC code words from the domain of the memorysubsystem by assuming that the identified bits within the ECC code wordsare corrupted.

Referring now to the drawings, FIG. 1 depicts memory subsystem 100 thatperforms data storage using a selectively enabled erasure mode accordingto one representative embodiment. Memory subsystem 100 includes memorycontroller 101 (e.g., a cache coherency controller). Memory controller101 manages the storage and retrieval of cache lines to and from thehierarchical arrangement of memory components in memory subsystem 100.Specifically, memory subsystem 100 includes a plurality of memoryquadrants 105 that are accessible by respective buses 104. As shown inFIG. 1, each memory quadrant 105 includes two DRAM buses 106 (showncollectively as 106-1 through 106-8) to enable access to eight memoryranks 107 (shown collectively as memory ranks 107-1 through 107-32).Each rank 107 includes a plurality of discrete DRAM banks (not shown) aswell known in the art. The plurality of ranks 107 may be implemented bytwo dual-in-line memory modules (DIMMs). In one representativeembodiment, a cache line is stored across a respective rank 107 tofacilitate correction of single-byte errors.

Memory controller 101 includes ECC logic 103 to append ECC redundancybits to cache lines during storage and to utilize the ECC redundancybits to perform error detection and correction upon retrieval of cachelines. The ECC redundancy bits may be used to address transient errors.Also, the ECC redundancy bits may be used to address repeatable errors.Specifically, malfunctions of various components may cause repeatableerrors for selected memory addresses and have no effect on other memoryaddresses. For example, a wire within DRAM bus 106-1 may exhibitintermittent failure. Cache lines retrieved from ranks 107-1 through107-4 will exhibit, from time to time, a repeated error for the bitassociated with the failing wire. However, cache lines retrieved fromranks 107-5 through 107-32 will not experience a corresponding error atthe same bit location. Registers 102 of controller 101 are used by ECClogic 103 to apply the erasure mode of the ECC algorithm to dataretrieved from the specific portion of memory subsystem 100 affected bya detected component failure. Hereinafter, the term “domain” shall beused to refer to any portion of the memory subsystem to which theerasure mode ECC processing may be applied independently of theremaining portion of the memory subsystem.

To correct repeatable errors according to an erasure mode in addition totransient errors, ECC logic 103 may utilize a suitable Reed-Solomonburst error correction code to perform single-byte correction. InReed-Solomon algorithms, the code word consists of n m-bit numbers:C=(c, c_(n-2), . . . , c_(o)). The code word may be representedmathematically by the following polynomial of degree n with thecoefficients (symbols) being elements in the finite Galios field(2^(m)): C(x)=(cx^(n−1)+c_(n−2)x^(n−2) . . . +c_(o)). The code word isgenerated utilizing a generator polynomial (typically denoted by g(x)).Specifically, the payload data (denoted by u(x)) is multiplied by thegenerator polynomial, i.e., C(x)=x^(n−k)u(x)+[x^(n−k)u(x)mod(g(x))] forsystematic coding. Systematic coding causes the original payload bits toappear explicitly in defined positions of the code word. The originalpayload bits are represented by x^(n−k)u(x) and the redundancyinformation is represented by [x^(n−k)u(x)mod(g(x))].

When the code word is subsequently retrieved from memory, the retrievedcode word may suffer data corruption due to a transient failure and/or arepeatable failure. The retrieved code word is represented by thepolynomial r(x). If r(x) includes data corruption, r(x) differs fromC(x) by an error signal e(x). The redundancy information is recalculatedfrom the retrieved code word. The original redundancy information asstored in memory and the newly calculated redundancy information arecombined utilizing an exclusive-or (XOR) operation to form the syndromepolynomial s(x). The syndrome polynomial is also related to the errorsignal. Using this relationship, several algorithms may determine theerror signal and thus correct the errors in the corrupted datarepresented by r(x). These techniques include error-locator polynomialdetermination, root finding for determining the positions of error(s),and error value determination for determining the correct bit-pattern ofthe error(s). For additional details related to recovery of the errorsignal e(x) from the syndrome s(x) according to Reed-Solomon burst errorcorrection codes, the reader is referred to THE ART OF ERROR CORRECTINGCODES by Robert H. Morelos-Zaragoza, pages 33-72 (2002), which isincorporated herein by reference.

Erasures in error correction codes are specific bits or specific stringsof bits that are known to be potentially corrupted without resorting tothe ECC functionality. For example, specific bits may be identified asbeing potentially corrupted due to a constant or intermittent hardwarefailure such as a malfunctioning DRAM component, a wire defect, and/orthe like. Introduction of erasures into the ECC algorithm isadvantageous, because the positions of the potentially corrupted bitsare known. Let d represent the minimum distance of a code, v representthe number of errors, and μ represent the number of erasures containedin a received ECC code word. Then, the minimum Hamming distance betweencode words is reduced to at least d−μ in the non-erased portions. Itfollows that the error-correcting capability is [(d−μ−1)/2] and thefollowing relation is maintained: d>2v+μ. Specifically, this inequalitydemonstrates that for a fixed minimum distance, it is twice as “easy” tocorrect an erasure as it is to correct a randomly positioned error.

In one representative embodiment, ECC logic 103 of memory controller 101may implement the decoding procedure of a [36, 33, 4] shortenednarrow-sense Reed-Solomon code (where the code word length is 36symbols, the payload length is 33 symbols, and the Hamming distance is 4bits) over the finite Galios field (2⁸). The finite Galios field definesthe symbol length to be 8 bits. By adapting ECC logic 103 in thismanner, the error correction may occur in two distinct modes. In a firstmode, ECC logic 103 performs single-byte correction. In the second mode(the erasure mode), a byte location (or locations) is specified in theECC code word as an erasure via a register setting. The location isidentified by a software or firmware process as a repeatable errorcaused by a hardware failure. ECC logic 103 decodes the retrieved databy assuming that the single-byte associated with the identified erasureis corrupted. Because the minimum Hamming distance is reduced, ECC logic103 enables the entire cache line to be recovered even when another(e.g., a transient) single-byte error is present in addition to theerasure error.

Additional details regarding a hardware implementation of the ECCalgorithm employing a selectively enabled erasure mode in a memorysubsystem may be found in U.S. patent application Ser. No. 10/435,150entitled “SYSTEMS AND METHODS FOR PROCESSING AN ERROR CORRECTION CODEWORD FOR STORAGE IN MEMORY COMPONENTS.”

As previously discussed, data corruption in a memory subsystem mayresult from a number of causes. Most frequently, the cause of datacorruption is a particle strike. A particle strike involves the transferof energy to a DRAM element thereby changing the state of the DRAMelement and corrupting the bit associated with the DRAM element. Aparticle strike is a random occurrence and, hence, falls within the“transient error” characterization. A particle strike does not indicatethat any hardware component is malfunctioning and the appropriateresponse generally involves correction of the data corruption. DRAMvendors estimate that discrete DRAM elements exhibit an error rate of5000 to 15000 failures in time (FIT), typically measured in billiondevice hours. Using 10,000 FIT as an average, a single DIMM can beexpected to experience a transient error once every 114 days. In amemory subsystem with 32 DIMMs, approximately 100 errors can be expectedper year. Accordingly, the observation of approximately 100 randomlyoccurring errors per year at random locations in a memory subsystem isnot a cause for concern.

Some representative embodiments maintain records of errors to determinewhether errors can be characterized as transient errors or repeatableerrors to control the selective activation of erasure mode ECCprocessing. FIG. 2 depicts computer system 200 employing a softwarealgorithm that records and analyzes memory errors to control theselective activation of erasure mode processing according to onerepresentative embodiment. Computer system 200 includes a plurality ofprocessors 201 that store and retrieve cache lines using memorysubsystem 100. When an occurrence of data corruption occurs upon theretrieval of a cache line, memory controller 101 detects the error andtemporarily stores information related to the error (e.g., the physicalmemory address and corrupted bits/bytes).

From time to time, error analysis algorithm 203 stored in systemfirmware 202 (or other suitable non-volatile memory or computer readablemedium) is executed by a processor 201. Error analysis algorithm 203polls memory controller 101 to obtain the information related todetected occurrences of data corruption. In response, error analysisalgorithm 203 records the occurrences in error records 204. Errorrecords 204 contain suitable information to enable repeated errors to bedetected such as the bit location(s) exhibiting the error(s), the memoryaddresses of the error(s), the buses, the memory ranks, the DRAM banksused to communicate a corrupted cache line, and/or the like. When arepeated error is detected, error analysis algorithm 203 sets respectiveregisters of memory controller 101 to erase the bits associated with therepeated pattern for the domain of memory subsystem 100 that generatedthe repeated error.

FIG. 3 depicts a flowchart for analyzing data corruption to selectivelyenable an erasure mode according to one representative embodiment. Theportions of flowchart of FIG. 3 may be implemented using executableinstructions or software code for error analysis algorithm 203. In step301, data corruption is detected during cache line retrieval from memorysubsystem 100 by memory controller 101. In step 302, error analysisalgorithm 203 is invoked. In step 303, error analysis algorithm 203polls memory controller 101 to determine whether any instances of datacorruption have occurred. In step 304, error records 204 are updated byerror analysis algorithm 203. Specifically, the occurrences of the datacorruption are recorded by error analysis algorithm 203. The occurrencesof data corruption as detailed in error records 204 are time-stamped orotherwise associated with suitable temporal information. Also, oldrecords of data corruption are erased (e.g., records that are older thantwenty-four hours). The purpose of erasing records according to temporalinformation is that the reliability of memory components istime-dependent. That is, an expected number of transient errors isrelated to an observation period. Thus, the determination whetherobserved errors are indicative of transient errors or repeatable errorsis facilitated by defining a consistent observation period throughappropriate deletion of old records.

In step 305, a logical comparison is made to determine whether thecurrent data corruption is consistent with prior bit patterns of datacorruption as reflected in error records 204. If not, the process flowproceeds to step 308 where error analysis 203 ends. If the logicalcomparison of step 305 is true, the process flow proceeds to step 306.

In step 306, a lowest-level domain associated with the bit patternassociated with the current data corruption is determined. In onerepresentative embodiment, the instances of data corruption for the bitpattern are examined to determine whether the patterns originated from(i) different addresses associated with a respective discrete DRAM bank;(ii) different addresses associated with a respective rank 107 of DRAMcomponents; (iii) different addresses associated with a respective DRAMbus 106; (iv) different addresses associated with a respective quadrantbus 104; or (v) different addresses scattered across memory subsystem100. From this analysis, the lowest-level domain is determined thatencompasses all of the instances of data corruption. In step 307, theerasure mode is activated for the determined domain to erase bitsassociated with the observed bit pattern of data corruption. In step308, the process flow ends.

FIG. 4 depicts another flowchart for analyzing data corruptionassociated with a memory subsystem using an ECC algorithm according onerepresentative embodiment. In step 401, errors in cache lines retrievedfrom the memory subsystem are detected using the ECC algorithm. In step402, the errors are analyzed to detect a repeated bit pattern of datacorruption within the cache lines. In step 403, the detected repeatedbit pattern of data corruption is correlated to one of a plurality ofdomains of the memory subsystem. In step 404, the ECC algorithm isapplied to erase bits associated with the detected repeated bit patternfrom cache lines retrieved from the correlated domain of the memorysubsystem.

It shall be appreciated that the memory subsystem architecture andfunctionality shown in FIG. 1 is by way of example only. Representativeembodiments may enable selective activation of erasure mode ECCprocessing in any suitable memory subsystem. Even though one embodimentapplies single-byte erasure error correction, representative embodimentscan employ any suitable ECC scheme that enable ECC code words to bedecoded by assuming identified bits are corrupted.

Representative embodiments enable a memory subsystem to be resilientagainst memory errors. Specifically, the correction of a repeated errorand a single transient error within the same cache line appreciablyreduces the probability that data corruption will cause an unrecoverableerror or a system crash. Representative embodiments enable thecorrection of such errors by analyzing occurrences of data corruptionand correlating those errors to specific components or portions of amemory subsystem. By performing such correlation, an erasure mode of anECC algorithm may be applied. Furthermore, the erasure mode of the ECCalgorithm can be limited to a specific subset of the memory subsystemthereby reducing performance limitations associated with the ECC erasuremode.

1. A method for controlling application of an erasure mode of an errorcorrection code (ECC) algorithm in a memory subsystem, comprising:detecting errors in cache lines retrieved from said memory subsystemusing said ECC algorithm; analyzing said errors to detect a repeated bitpattern of data corruption within said cache lines; correlating saiddetected repeated bit pattern of data corruption to one of a pluralityof domains of said memory subsystem; and applying said ECC algorithm toerase bits associated with said detected repeated bit pattern from cachelines retrieved from said correlated domain of said memory subsystem. 2.The method of claim 1 wherein said applying said ECC algorithm erasessaid bits associated with said detected repeated bit pattern from cachelines retrieved from a memory rank.
 3. The method of claim 1 whereinsaid applying said ECC algorithm erases said bits associated with saiddetected repeated bit pattern from cache lines communicated over a firstbus coupled to a plurality of memory ranks.
 4. of claim The method 1wherein said applying said ECC algorithm erases said bits associatedwith said detected repeated bit pattern from cache lines communicatedover a first bus coupled to a second bus and a third bus, wherein saidsecond and third buses are coupled to pluralities of memory ranks. 5.The method of claim 1 wherein said bits to be erased are bits retrievedfrom a discrete memory bank.
 6. The method of claim 1 wherein said ECCalgorithm corrects an occurrence of a single-byte transient error in acache line retrieved from said memory subsystem and, in said erasuremode, corrects an additional single-byte repeatable error correspondingto said erased bits.
 7. The method of claim 1 wherein said applyingcomprises: loading a register of a memory controller identifying saidbits associated with said repeated bit pattern to be erased.
 8. Themethod of claim 1 wherein said analyzing, correlating, and applying areperformed by software stored in firmware of a computer system.
 9. Themethod of claim 1 wherein said ECC algorithm enables erasure ofsingle-bytes from retrieved cache lines.